- 1 Introduction
- 2 Software
- 3 Experiment setting
- 4 Data and preprocessing
- 5 Observation Model and Functional Analysis
- 6 Dimensionality Reduction & Smoothing
- 7 Learning the graph topology
- 8 Semi-Supervised Clustering Based on Signed Total Variation
- 9 Classification
- 10 Conclusions and possible improvements
1 Introduction
With this work we propose a model to recognize the speaker and the place of registrations token using Science Journal app. These registrations are made of various kinds of environment sensing implemented through the mobile’s sensors. We faced the task using most of the main tools studied during the Statistical Learning course; in particular, we proceeded with some fundamental and essential guidelines:
Sensing is, obviously, time dependent and we have to take this into account, so the basis of the work is Functional Analysis, in particular basis expansion.
Basis expansion means, in general, that there will be a huge number of features (coefficients) in the dataset, so another keyword is Dimensionality Reduction.
ML problems and techniques are constantly evolving and we thought a good idea is using State-of-the-Art methods. In particular, here we propose a graph-based semi-supervised clustering algorithm. The paper poster can be freely read here.
All of these guidelines will be exploited and explained in the next paragraphs.
2 Software
To implement the model we used three different software (based on different programming languages):
Jupyter Notebook (python based).
RStudio.
Matlab
The choice of not using a single tool was based on the fact that each one of these has its workhorses that have been useful for us to implement different things.
3 Experiment setting
The model has to recognize three speakers (Claudio, Lorenzo and Egon) and three places (House, Street, Subway). Each registration grabbed with Science Journal recorded the pronunciation of the phrase “Vaga in vari mari” with one second of silence before it and one second of silence after it to allow us to make all the registrations of the same duration.
4 Data and preprocessing
The first problem to face was clear as soon as we saw the registrations:
| timestamp | AmbientLightSensor | DecibelSource | PitchSensor | AccX | AccY |
|---|---|---|---|---|---|
| 1.560265e+12 | NA | 24.06718 | NA | NA | NA |
| 1.560265e+12 | NA | 22.97557 | NA | NA | NA |
| 1.560265e+12 | NA | 24.50082 | NA | NA | NA |
| 1.560265e+12 | NA | 24.04070 | 0 | NA | NA |
| 1.560265e+12 | NA | 23.47812 | NA | NA | NA |
| 1.560265e+12 | 83 | NA | NA | NA | NA |
| 1.560265e+12 | NA | NA | NA | -1.388 | -2.174 |
| 1.560265e+12 | NA | 23.62310 | NA | NA | NA |
| 1.560265e+12 | 82 | NA | NA | -1.216 | -2.154 |
| 1.560265e+12 | 82 | NA | NA | NA | NA |
| 1.560265e+12 | NA | 24.81699 | NA | NA | NA |
| 1.560265e+12 | NA | NA | NA | -1.436 | -1.618 |
| 1.560265e+12 | 81 | 23.08598 | 0 | NA | NA |
| 1.560265e+12 | NA | 25.02964 | NA | NA | NA |
| 1.560265e+12 | NA | NA | NA | -1.149 | -2.260 |
| 1.560265e+12 | NA | 23.82970 | NA | NA | NA |
| 1.560265e+12 | NA | 23.87517 | NA | NA | NA |
| 1.560265e+12 | NA | NA | NA | NA | NA |
| 1.560265e+12 | 80 | NA | NA | NA | NA |
| 1.560265e+12 | NA | 23.21267 | 0 | NA | NA |
The sampling period is not equal for all sensors and consequentially NAs are present. The “imputation” in this case is, however, easier than the case of i.i.d registrations: these are functions of time describing classical physics phenomena, so we just need a decent way to interpolate them. Of course there are many techniques to do this but, thanks to the fact that the timestamp (and so the sampling periods) are really short, we decided to use a simple linear interpolator.
Then we had to cut off some observations for each registration in order to guarantee that all the registrations last the same time and we’re pretty sure we were not missing information thanks to the set up of the experiment explained above.
At this point, each registration was complete and of uniform size. The next step was creating the final dataset by collecting all the registrations in a single data frame indexed by speaker, place and sensor type:
| X1 | location | guy | sensor | 0 | 1 | 2 |
|---|---|---|---|---|---|---|
| 0 | casa | claudio | AmbientLightSensor | 80.0000000 | 80.0000000 | 80.500000 |
| 1 | casa | claudio | DecibelSource | 24.6802407 | 24.5645959 | 24.448951 |
| 2 | casa | claudio | PitchSensor | 20.2978516 | 27.0638021 | 33.829753 |
| 3 | casa | claudio | LinearAccelerometerSensor | 1.5938633 | 1.5612082 | 1.528553 |
| 4 | casa | claudio | AccX | -0.7565000 | -0.7660000 | -0.775500 |
| 5 | casa | claudio | AccY | -1.8496667 | -1.9183334 | -1.987000 |
| 6 | casa | claudio | AccZ | 10.6110001 | 10.6110001 | 10.611000 |
| 7 | casa | claudio | CompassSensor | 97.4488019 | 97.4488019 | 97.448802 |
| 8 | casa | claudio | MagneticRotationSensor | 31.1448230 | 31.1448230 | 31.144823 |
| 9 | casa | claudio | AmbientLightSensor | 33.1250000 | 33.0000000 | 33.000000 |
| 10 | casa | claudio | DecibelSource | 67.6226551 | 71.7784598 | 75.934265 |
| 11 | casa | claudio | PitchSensor | 63.9766715 | 69.3080608 | 74.639450 |
| 12 | casa | claudio | LinearAccelerometerSensor | 0.8420951 | 0.8447015 | 0.847308 |
| 13 | casa | claudio | AccX | -0.6320000 | -0.6000000 | -0.568000 |
| 14 | casa | claudio | AccY | -1.7996667 | -1.7613334 | -1.723000 |
| 15 | casa | claudio | AccZ | 9.8160000 | 9.7840001 | 9.752000 |
| 16 | casa | claudio | CompassSensor | 161.6595537 | 161.6783185 | 161.697083 |
| 17 | casa | claudio | MagneticRotationSensor | 84.3326746 | 84.3326746 | 84.332675 |
| 18 | casa | claudio | AmbientLightSensor | 32.0000000 | 32.0000000 | 32.000000 |
| 19 | casa | claudio | DecibelSource | 21.5215097 | 23.1700584 | 22.769880 |
All this preprocessing step were done using Jupyter due to Pyton versatility.
Just to take a look, let’s plot one of the sound intensity registrations:
At this point we normalized the numerical variables (so the functions images) to get an isotropic sample and label encoded the categorical ones (i.e.: “Claudio” \(\rightarrow\) 1, “Lorenzo” \(\rightarrow\) 2). Finally we had our (starting) dataset.
5 Observation Model and Functional Analysis
The observations are records of the same lengths \(\Delta T\). Each sensor will return (almost) equispaced samples of the corresponding continuous physical quantity. We assumed an isotropic Many Normal Means Model (MNMM) for the observations. More formally, the \(i-th\) observation of the sensor \(j\) can be expressed as:
\[ y_{j,i} = f_j(i) + \epsilon_i, \qquad \text{with} \quad \epsilon_i \sim N(0,\sigma_j^2), \quad j \in\{1,...,N\} \quad (1) \]
Here \(i \in \{0,1,...,\Delta T \cdot F_s \}\) is a shortcut to indicate the sampling period \(i\cdot \frac{1}{F_s}\) where \(F_s\) is the sampling frequency and \(N\) is the number of sensors.
Once we got the samples, we were firstly interested in recover \(f_j(\cdot)\) from them. To do this and thanks to the fact that all the records are of the same duration \(\Delta T\) we could just map the sampling times set \(\mathbb{S} :=\{0,\frac{1}{F_s},2\cdot \frac{1}{F_s},...,\Delta T\}\) in \(\mathbb{S}_n := \{0,\frac{1}{m},...,1\}\) where \(m = |\mathbb{S}|\) without loss of generality.
At this point, it was legit (due to their intrinsic physic nature) to assume that the functions of interest belong to \(L_2([0,1])\) and we could use all the theory we know about it to estimate and clean the corresponding Generalized Fourier Coefficients modeling an expansion over an orthonormal basis for the space. In particular, given as known all the theory about GFC, we know we can estimate in general, \(m\) GFCs with \(m\) observations, such that our function estimator become:
\[ \hat{f}_j(x) \approx \sum_{i=1}^m [\tilde{\alpha}_j]_i \phi_i(x), \quad j = 1,...,N \]
where \([\tilde{\alpha}_j]_i = \frac{1}{m}<\mathbf{y}_j, \phi_i> = \frac{1}{m} \sum_{k=1}^m [\mathbf{y}_j]_k \phi_j(k)\) where \(k\) has the same meaning of \(i\) in \((1)\) and this raw GFCs are asymptotically unbiased if the time instants are equispaced as in our case. As usual, we could generalize the previous formula in matrix notation for all sensors as:
\[\tilde{\alpha}_j=\frac{1}{m}\Phi\mathbf{y}_j\]
where \([\Phi]_{jk} = \phi_j(k)\).
We decided to use the cosine basis to project the observations on. However, before this, we realized it was useless to project each single observation and this because the chosen classification model, as the reader will see, is graph-based and we needed a unique coordinates set per each registration to learn the graph topology.
So we had to think about how many sensors take into account and how to combine them. Considering only the “Sound Intensity (dB)” (\(\mathbf{y}_1\)) and the “Tonality (Hz)” (\(\mathbf{y}_2\)) and taking as new observations a convex combination of them seemed us a good choice. So, just to give the reader the sense of the workflow of the project, at this point each registration is uniquely identified by a set of observations obtained as:
\[\mathbf{g}=\beta \mathbf{y}_1+(1-\beta)\mathbf{y}_2 \approx g(\cdot) = \beta f_1(\cdot)+(1-\beta)f_2(\cdot), \quad \beta \in [0,1]\]
We chose \(\beta = .5\). At this point the dataset looks like:
| V195 | V196 | V197 | V198 | V199 | V200 | V201 | V202 |
|---|---|---|---|---|---|---|---|
| 0.5208995 | 0.6471316 | 0.7733638 | 0.8041914 | 0.8196940 | 0.8351967 | 1 | 1 |
| 0.0483562 | 0.0153240 | -0.0177081 | 0.0108282 | 0.0018937 | -0.0070408 | 1 | 1 |
| -0.1184372 | -0.1182831 | 0.0699706 | 0.2582243 | 0.2243966 | 0.2412886 | 1 | 1 |
| 0.0620739 | 0.0192037 | -0.0236664 | -0.0172280 | -0.0427814 | -0.0477524 | 1 | 1 |
| 0.3863144 | 0.3875223 | 0.3887301 | 0.4021117 | 0.4025329 | 0.4029541 | 1 | 1 |
| -0.3375167 | -0.3296564 | -0.3217960 | -0.3139357 | -0.2754922 | -0.2370487 | 1 | 1 |
| 0.0193384 | 0.0043150 | -0.0208629 | -0.0275620 | -0.0342611 | -0.0420476 | 1 | 1 |
| 0.1104831 | 0.0598716 | -0.3937332 | -0.2842210 | -0.1747088 | -0.0651965 | 1 | 1 |
| 0.0098740 | 0.0207646 | -0.0077667 | -0.0362980 | -0.0648293 | -0.0933607 | 1 | 1 |
| 0.0616137 | 0.0534310 | 0.0163925 | -0.0489280 | -0.0316645 | -0.0144010 | 1 | 1 |
| 0.4190474 | 0.4131426 | 0.4020458 | 0.3909491 | 0.3798523 | 0.3085279 | 1 | 1 |
| -0.1166068 | -0.1233338 | -0.1300608 | -0.1372210 | -0.1443811 | -0.0409733 | 1 | 1 |
| -0.1747848 | -0.1705684 | -0.1554143 | -0.1402602 | -0.1251061 | -0.1354940 | 1 | 1 |
| 0.0547581 | 0.0185215 | -0.0040442 | -0.0266100 | -0.0796416 | -0.0672566 | 1 | 1 |
| 0.6302543 | 0.6569402 | 0.5627311 | 0.4685220 | 0.4119227 | 0.4056838 | 1 | 1 |
| 0.0538830 | 0.0199179 | -0.2256919 | -0.1808349 | -0.1710996 | -0.1613642 | 1 | 1 |
| -0.1622574 | -0.1428959 | -0.1235343 | -0.1041727 | -0.0971946 | -0.0902166 | 1 | 1 |
| 0.0350673 | 0.0373273 | 0.0395872 | -0.1612555 | -0.1040341 | -0.0468127 | 1 | 1 |
| 0.2757604 | 0.1962648 | 0.1167691 | 0.0372735 | -0.0422221 | -0.1217178 | 1 | 1 |
| -0.8749670 | -0.9873217 | -1.0996764 | -1.2120312 | -1.3243859 | -1.3510310 | 2 | 1 |
Where the last two columns are the encoding of speakers and places.
Finally we can project this new observations on the cosine basis.
6 Dimensionality Reduction & Smoothing
Once we got the coefficients for all the observations, we decided to cut off some of them to smooth the initial functions by minimizing the total regret. We obtained as cutoff \(\hat{J} = 57\).
This \(57\) coefficients per each registration will be the coordinates set in \(\mathbb{R}^{57}\) in the graph identifying the speaker.
We can take a look to the smoothing effect:
Instead, to identify the places, we chose to use the last \(25\)% of the coefficients, because we thought that the place discriminant is the noise and, as usual, the noise is related to high frequency components, so the highest indices coefficients.
At this point we had two datasets with different coefficients, one for places and one for speakers.
Why using the vectorial space in which coefficients live is a good idea will be clearer in the next paragraph.
The previous smoothing and dimensionality reduction steps are implemented using RStudio.
7 Learning the graph topology
Our objective is encoding in the graph topology a concept of pairwise distance among observations. A good choice (that is a proper distance in \(L_2([0,1])\)) is (i.e. between registrations \(i\) and \(j\)):
\[d(i,j) = ||g_i(t) - g_j(t)||_{L_{2}}^2 = \int_{[0,1]}(g_i(t) - g_j(t))^2 dt\]
Thanks to the fact that we’re using an orthonormal basis, the Parseval identity holds, so:
\[d(i,j) = ||g_i(t) - g_j(t)||_{L_2}^2 = ||\alpha_i - \alpha_j||_2^2 \approx ||\tilde{\alpha}_i - \tilde{\alpha}_j||_2^2 = \sum_{k=j}^J ([\tilde{\alpha}_i]_k - [\tilde{\alpha}_j]_k)^2 \]
Where \(j\in \{1,\text{ceil}(.75\cdot m)\}\) and \(J \in \{\hat{J},m\}\) depending on the dataset.
This is why makes sense considering usual Euclidean norm among vector of coefficients to learn the graph.
Given this, we learn the two graphs using KNN with the first 10 nearest neighbors.
This step is implemented using again Jupyter.
Let’s take a look at the two graphs with the highlighted ground-truth clusters (obvs. coordinates are meaningless, edges matter):
Places Graph
Speakers Graph
8 Semi-Supervised Clustering Based on Signed Total Variation
The algorithm (Berger ’18) description and main concepts, again, are here [2]. Follows, anyway, the poster:
From now we consider the reader has read the poster.
The novelty of the work is inserting a concept of dissimilarity and not only similarity in the graph structure. This is combined with a slightly older technique that is based on Total Variation minimization. We can say that the algorithm mantra is “cluster in a way that assumes linked nodes are likely to be in the same cluster but paying attention to the labeled nodes and to dissimilarities among them”.
We wanna show that, after unlabeling the most of the registrations (nodes), the algorithm can cluster them with good results.
This is a 2 class clustering algorithm, so we adopted a one vs all strategy, training a single classifier (clustering) per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued confidence score for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample.
The decision in ambiguous cases is made by choosing the class with the highest score. In our case, we decided that the score is the maximum among the optimal values that the algorithm outputs per each class minimizing the problem 3 in [2].
The classes are balanced and the accuracy is a good measure.
The algorithm has been implemented from scratch using Matlab (better solvers).
The algorithm parameters are:
- \(L = L_1 + L_{-1}\) : the number of labeled nodes at each 1 (“1”) vs Others (“-1”) iteration.
- \(M\): the number of dissimilarities among \(L_1\) and \(L_{-1}\) nodes. The dissimilarity between nodes \(i\) and \(j\) is weighted with \(-\sqrt{d(i,j)}\) where \(d(i,j)\) is again the euclidean distance among the correspondent coefficients.
We splitted the dataset in a training set (about 50 registrations per class) and a test set (useful later) and we worked on the training set.
This is the set up of our simulations:
- 139 nodes.
- 3 classes per each dataset.
In particular, for Semi-supervised speaker clustering:
- \(L_1 = 20 \, ( /(50 \pm \epsilon)\)
- \(L_{-1} = 35 \, ( /(150 \pm \epsilon)\)
- \(M = 3\) randomly inserted between \(L_1\) and \(L_{-1}\)
Reached accuracy: \(91\)%.
Instead, for Semi-supervised places clustering:
- \(L_1 = 20 \, ( /(50 \pm \epsilon)\)
- \(L_{-1} = 35 \, ( /(150 \pm \epsilon)\)
- \(M = 6\) randomly inserted between \(L_1\) and \(L_{-1}\)
Reached accuracy: \(82\)%.
Obviously, but just to remember it, the accuracy grows increasing \(L\). The little fluctuations are due to the randomness injected by the random labeling:
9 Classification
Once the cluster are made, what happen if a new registration has to be classified?
Our simple idea is using a KNN classifier, so the new registration is classified with the most frequent class of its \(k\) nearest neighbors (again among GFCs) that, given the way we learn the graph, is equivalent to say we’re assigning it to the most frequent class of its adjacent nodes.
We chose the \(k\) the maximizes the accuracy on the test set. In particular:
- Speaker classification:
Reached accuracy: \(88\)% with \(k=3\)
- Speaker classification:
Reached accuracy:\(83\)% with \(k=3\)
Fluctuations for the same reason mentioned above. We can live with them, we ensure the results are always the same. Clearly MC simulation was the theoretical alternative.
10 Conclusions and possible improvements
The results are pretty good. Of course many little things can be improved to make the work better, but this is not the place. With this project we tried to cover a big slice of the topics faced during the course while proposing a very recent idea and facing its implementation from scratch.
Some improvements (i.e.): other more complex classifiers could be used in the final section, some parameters like \(\beta\) and \(M\) could be setted with a grid or a random search, more sensors (maybe) could be involved (but different mobiles are very inconsistent on the other sensors) and so on.
Thank you for the course!